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(54) Estimation of excitation parameters 

(57) Excitation parameters for a digitized speech 
signal are determined by analysing the digitized speech 
signal. The digitized speech signal is divided intoat least 
two frequency bands. A first preliminary excitation pa- 
rameter is determined by performing a nonlinear oper- 
ation on at least one of the frequency band signals to 
produce a modified frequency band signal and deter- 
mining the first preliminary excitation parameter using 
the modified frequency band signal. A second prelimi- 



nary excitation parameter is determined using a method 
different from the first method. The first and second pre- 
liminary excitation parameters are used to determine an 
excitation parameter for the digitized speech signal. The 
method is useful in encoding speech. Speech synthe- 
sized using the parameters estimated based on the in- 
vention generates high quality speech at various bit 
rates useful for applications such as satellite voice com- 
munication. 
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Description 

The invention has arisen from work seeking to improve the accuracy with which excitation parameters are estimated 
in speech analysis and synthesis. 
5 Speech analysis and synthesis are widely used in applications such as telecommunications and voice recognition. 

A vocoder, which is a type of speech analysis/synthesis system, models speech as the response of a system to exci- 
tation over short time intervals. Examples of vocoder systems include linear prediction vocoders, homornorphic voco- 
ders, channel vocoders, sinusoidal transform coders ("STC"), multiband excitation ("MBE") vocoders, improved multi- 
band excitation ("IMBE (TM)") vocoders. 

to Vocoders typically synthesize speech based on excitation parameters and system parameters. Typically, an input 

signal is segmented using, for example, a Hamming window. Then, for each segment, system parameters and excitation 
parameters are determined. System parameters include the spectral envelope or the impulse response of the system. 
Excitation parameters include a fundamental frequency (or pitch) and a voiced/unvoiced parameter that indicates 
whether the input signal has pitch (or indicates the degree to which the input signal has pitch). In vocoders that divide 

1$ the speech into frequency bands, such as IMBE (TM) vocoders, the excitation parameters may also include a voiced/ 
unvoiced parameter for each frequency band rather than a single voiced/unvoiced parameter. Accurate excitation 
parameters are essential for high quality speech synthesis. 

When the voiced/unvoiced parameters include only a single voiced/unvoiced decision for the entire frequency 
band, the synthesized speech tends to have a "buzzy" quality especially noticeable in regions of speech which contain 

20 mixed voicing or in voiced regions of noisy speech. A number of mixed excitation models have been proposed as 
potential solutions to the problem of "buzziness" in vocoders. In these models, periodic and noise-like excitations are 
mixed which have either time-invariant or time-varying spectral shapes. 

In excitation models having time-invariant spectral shapes, the excitation signal consists of the sum of a periodic 
source and a noise source with fixed spectral envelopes. The mixture ratio controls the relative amplitudes of the 

25 periodic and noise sources. Examples of such models include Itakura and Saito, "Analysis Synthesis Telephony Based 
upon the Maximum Likelihood Method," Reports of 6th Int. Cong. Acoust., Tokyo, Japan, Paper C-5-5, pp. C17-20, 
1 968: and Kwon and Goldberg, "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch, " IEEE Trans, on Acoust., 
Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 1984. In theses excitation models a white 
noise source is added to a white periodic source. The mixture ratio between these sources is estimated from the height 

30 of the peak of the autocorrelation of the LPC residual. 

In excitation models having time-varying spectral shapes, the excitation signal consists of the sum of a periodic 
source and a noise source with time varying spectral envelope shapes. Examples of such models include Fujimara, 
"An Approximation to Voice Aperiodicity," IEEE Trans. Audio and Electroacoust., pp. 68-72, March 1968; Makhoul et 
al , "A Mixed-Source Excitation Model for Speech Compression and Synthesis," IEEE Int. Conf. on Acoust. Sp. & Sig. 

35 Proa, April 1978, pp. 163-166: Kwon and Goldberg, "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch," 
IEEE Trans, on AcousL, Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 1984: and Griffin 
and Lim, "Multiband Excitation Vocoder," IEEE Trans. 

Acoust., Speech, Signal Processing, vol. ASSP-36, pp. 1223-1235, Aug. 1988. 

In the excitation model proposed by Fujimara, the excitation spectrum is divided into three fixed frequency bands. 
40 A separate cepstral analysis is performed for each frequency band and a voiced/unvoiced decision for each frequency 
band is made based on the height of the cepstrum peak as a measure of periodicity. 

In the excitation model proposed by Makhoul et al., the excitation signal consists of the sum of a low-pass periodic 
source and a high-pass noise source. The low-pass periodic source is generated by filtering a white pulse source with 
a variable cut-off low-pass filter. Similarly, the high-pass noise source was generated by filtering a white noise source 
with a variable cut-off high-pass filter. The cut-off frequencies for the two filters are equal and are estimated by choosing 
the highest frequency at which the spectrum is periodic. Periodicity of the spectrum is determined by examining the 
separation between consecutive peaks and determining whether the separations are the same, within some tolerance 
level. 

In a second excitation model implemented by Kwon and Goldberg, a pulse source is passed through a variable 
50 gain low-pass filter and added to itself, and a white noise source is passed through a variable gain high-pass filter and 
added to itself. The excitation signal is the sum of the resultant pulse and noise sources with the relative amplitudes 
controlled by a voiced/unvoiced mixture ratio. The filter gains and voiced/unvoiced mixture ratio are estimated from 
the LPC residual signal with the constraint that the spectral envelope of the resultant excitation signal is flat. 

In the multiband excitation model proposed by Griffin and Lim, a frequency dependent voiced/unvoiced mixture 
55 function is proposed. This model is restricted to a frequency dependent binary voiced/unvoiced decision for coding 
purposes. A further restriction of this model divides the spectrum into a finite number of frequency bands with a binary 
voiced/unvoiced decision for each band. The voiced/unvoiced information is estimated by comparing the speech spec- 
trum to the closest periodic spectrum. When the error is below a threshold, the band is marked voiced, otherwise, the 



45 



2 



BNSDOCID: <£P_0?22166A3JL> 



EP0 722 165 A2 



band is marked unvoiced. 

Excitation parameters may also be used in applications, such as speech recognition, where no speech synthesis 
is required. Once again, the accuracy of the excitation parameters directly affects the performance of such a system. 

In one aspect, generally, the invention features a hybrid excitation parameter estimation technique that produces 
5 two sets of excitation parameters for a speech signal using two different approaches and combines the two sets to 
produce a single set of excitation parameters, fn a first approach, the technique applies a nonlinear operation to the 
speech signal to emphasize the fundamental frequency of the speech signal. In a second approach, we use a different 
method which may or may not include a nonlinear operation. While the first approach produces highly accurate exci- 
tation parameters under most conditions, the second approach produces more accurate parameters under certain 
10 conditions. By using both approaches and combining the resulting sets of excitation parameters to produce a single 
set, our technique produces accurate results under a wider range of conditions than are produced by either of the 
approaches individually. 

In typical approaches to determining excitation parameters, an analog speech signal s(t) is sampled to produce a 
speech signal s(n). Speech signal s(n) is then multiplied by a window w(n) to produce a windowed signal s w (n) that is 
is commonly referred to as a speech segment or a speech frame. A Fourier transform is then performed on windowed 
signal s w (n) to produce a frequency spectrum S^co) from which the excitation parameters are determined. 

When speech signal s(n) is periodic with a fundamental frequency co 0 or pitch period n c (where n 0 equals 2/c/co 0 ), 
the frequency spectrum of speech signal s(n) should be a line spectrum with energy at <o G and harmonics thereof 
(integral multiples of co Q ). As expected, S w (co) has spectral peaks that are centered around co 0 and its harmonics. 
20 However, due to the windowing operation, the spectral peaks include some width, where the width depends on the 
length and shape of window w(n) and tends to decrease as the length of window w(n) increases. This window-induced 
error reduces the accuracy of the excitation parameters. Thus, to decrease the width of the spectral peaks, and to 
thereby increase the accuracy of the excitation parameters, the length of window w(n) should be made as long as 
possible. 

2S The maximum useful length of window w(n) is limited. Speech signals are not stationary signals, and instead have 

fundamental frequencies that changeover time. To obtain meaningful excitation parameters, an analyzed speech seg- 
ment must have a substantially unchanged fundamental frequency. Thus, the length of window w(n) must be short 
enough to ensure that the fundamental frequency will not change significantly within the window 

In addition to limiting the maximum length of window w(n), a changing fundamental frequency tends to broaden 

30 the spectral peaks. This broadening effect increases with increasing frequency. For example, if the fundamental fre- 
quency changes by Aco 0 during the window, the frequency of the mth harmonic, which has a frequency of mco OJ changes 
by mAco 0 so that the spectral peak corresponding to mco 0 is broadened more than the spectral peak corresponding to 
co 0 . This increased broadening of the higher harmonics reduces the effectiveness of higher harmonics in the estimation 
of the fundamental frequency and the generation of voiced/unvoiced parameters for high frequency bands. 

3$ By applying a nonlinear operation to the speech signal, the increased impact on higher harmonics of a changing 

fundamental frequency is reduced or eliminated, and higher harmonics perform better in estimation of the fundamental 
frequency and determination of voiced/unvoiced parameters. Suitable nonlinear operations map from complex (or real) 
to real values and produce outputs that are nondecreasing functions of the magnitudes of the complex (or real) values. 
Such operations include, for example, the absolute value, the absolute value squared, the absolute value raised to 

40 some other power, or the log of the absolute value. 

Nonlinear operations tend to produce output signals having spectral peaks at the fundamental frequencies of their 
input signals. This is true even when an input signal does not have a spectral peak at the fundamental frequency. For 
example, if a bandpass filter that only passes frequencies in the range between the third and fifth harmonics of co Q is 
applied to a speech signal s(n), the output of the bandpass filter, x(n), will have spectral peaks at 3co 0 , 4co 0 and 5to 0 . 

45 Though x(n) does not have a spectral peak at co 0 , I x(n)l 2 will have such a peak. For a real signal x(n), I x(n)l 2 is 

equivalent to x 2 (n). As is well known, the Fourier transform of x 2 (n) is the convolution of X(co), the Fourier transform of 
x(n), with X(co): 



£ x 2 in) e- Jw/I = f X(oj-u) X(u) du. 

The convolution of X(co) with X(<o) has spectral peaks at frequencies equal to the differences between the frequencies 
for which X(oj) has spectral peaks. The differences between the spectral peaks of a periodic signal are the fundamental 
frequency and its multiples. Thus, in the example in which X(oj) has spectral peaks at 3c» 0 , 4o> 0 and 5co 0 , X(co) convolved 
with X(co) has a spectral peak at co 0 (4co 0 -3co 0 , 5co 0 -4co 0 ). For a typical periodic signal, the spectral peak at the funda- 
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mental frequency is likely to be the most prominent. 

The above discussion also applies to complex signals. For a complex signal x(n), the Fourier transform of I x(n)l 2 is: 

5 oo It 

E|x(n) | 2 e^ n = -A- f X(v+u)X*(u)du. 
271 J 

23=-°° u=-n 

io This is an autocorrelation of X(co) with X*(co), and also has the property that spectral peaks separated by nco 0 produce 
peaks at nco Q . 

Even though I x(n)l t I x(n)l a for some real "a", and log' I x(n)l are not the same as I x(n)l 2 , the discussion above 
for I x(n)l 2 applies approximately at the qualitative level. For example, for I x(n)l = y(n) 0 - 5 , where y(n) = I x(n)l 2 , a Taylor 
series expansion of y(n) can be expressed as: 

15 

oo 

l x(iJ) I = ]p 0 c *y k(n) • 

20 

Because multiplication is associative, the Fourier transform of the signal y k (n) is Y(co) convolved with the Fourier trans- 
form of y k " 1 (rt). The behavior for nonlinear operations other than I x(n)l 2 can be derived from I x(n)l 2 by observing the 
behavior of multiple convolutions of Y(co) with itself. If Y(<o) has peaks at nco 0 , then multiple convolutions of Y(co) with 

25 itself will also have peaks at nco Q . 

As shown, nonlinear operations emphasize the fundamental frequency of a periodic signal, and are particularly 
useful when the periodic signal includes significant energy at higher harmonics. However, the presence of the nonlin- 
earity can degrade performance in some cases. For example, performance may be degraded when speech signal s 
(n) is divided into multiple bands s'(n) using bandpass filters, where s'(n) denotes the result of bandpass filtering using 

30 the ith bandpass filter. If a single harmonic of the fundamental frequency is present in the pass band of the ith filter, 
the output of the filter is: 

; j(o> k n+e k ) 
s (n) = A k e 



35 



40 



where co k is the frequency, e k is the phase, and A k is the amplitude of the harmonic. When a nonlinearity such as the 
absolute value is applied to s'(n) to produce a value y'(n), the result is: 



y'(n) = |s'"(n)| =|AJ 



so that the frequency information has been completely removed from the signal y'(n). Removal of this frequency infor- 
mation can reduce the accuracy of parameter estimates. 

Our hybrid technique provides significantly improved parameter estimation performance in cases for which the 

45 nonlinearity reduces the accuracy of parameter estimates while maintaining the benefits of the nonlinearity in the re- 
maining cases. As described above, the hybrid technique includes combining parameter estimates based on the signal 
after the nonlinearity has been applied (y'(n)) with parameter estimates based on the signal before the nonlinearity is 
applied (s'(n) or s(n)). The two approaches produce parameter estimates along with an indication of the probability of 
correctness of these parameter estimates. The parameter estimates are then combined giving higher weight to esti- 

so mates with a higher probability of being correct. 

In another aspect, generally, the invention features the application of smoothing techniques to the voiced/unvoiced 
parameters. Voiced/unvoiced parameters can be binary or continuous functions of time and/or frequency: Because 
these parameters tend to be smooth functions in at least one direction (positive or negative) of time or frequency, the 
estimates of these parameters can benefit from appropriate application of smoothing techniques in time and/or fre- 

55 quency. 

The invention also features an improved technique for estimating voiced/unvoiced parameters. In vocoders such 
as linear prediction vocoders, homomorphic vocoders, channel vocoders, sinusoidal transform coders, multiband ex- 
citation vocoders, and IMBE (TM) vocoders, a pitch period n (or equivalently a fundamental frequency) is selected. 
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Thereafter, a function f'(n) is then evaluated at the selected pitch period (or fundamental frequency) to estimate the rth 
voiced/unvoiced parameter. However, for some speech signals, evaluation of this function only at the selected pitch 
period will result in reduced accuracy of one or more voiced/unvoiced parameter estimates. This reduced accuracy 
may result from speech signals that are more periodic at a multiple of the pitch period than at the pitch period, and 

5 may be frequency dependent so that only certain portions of the spectrum are more periodic at a multiple of the pitch 
period. Consequently, the voiced/unvoiced parameter estimation accuracy can be improved by evaluating the function 
f*(n) at the pitch period n and at its multiples, and thereafter combining the results of these evaluations. 

In another aspect, the invention features an improved technique for estimating the fundamental frequency or pitch 
period. When the fundamental frequency co 0 (or pitch period n c ) is estimated, there may be some ambiguity as to 

10 whether co Q or a submultiple or multiple of co 0 is the best choice for the fundamental frequency Since the fundamental 
frequency tends to be a smooth function of time for voiced speech, predictions of the fundamental frequency based 
on past estimates can be used to resolve ambiguities and improve the fundamental frequency estimate. 

Other features and advantages of the invention will be apparent from the following description of the preferred 
embodiments, given by way of example only. 

is In the drawings: 

Fig. 1 is a block diagram of a system for determining whether frequency bands of a signal are voiced or unvoiced. 

Fig. 2 is a block diagram of a parameter estimation unit of the system of Fig. 1 . 

Fig. 3 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 2. 

Fig. 4 is a block diagram of a parameter estimation unit of the system of Fig. 1 . 
20 Fig. 5 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 4. 

Fig. 6 is a block diagram of a parameter estimation unit of the system of Fig. 1 . 

Fig. 7 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 6. 

Figs. 8-10 are block diagrams of systems for determining the fundamental frequency of a signal. 

Fig. 11 is a block diagram of voiced/unvoiced parameter smoothing unit. 
25 Fig. 12 is a block diagram of voiced/unvoiced parameter improvement unit. 

Fig. 13 is a block diagram of a fundamental frequency improvement unit. 

Figs. 1-12 show the structure of a system for estimating excitation parameters, the various blocks and units of 
which are preferably implemented with software. 

With reference to Fig. 1, a voiced/unvoiced determination system 10 includes a sampling unit 12 that samples an 
30 analog speech signal s(t) to produce a speech signal s(n). For typical speech coding applications, the sampling rate 
ranges between six kilohertz and ten kilohertz. 

Speech signal s(n) is supplied to a first parameter estimator 14 that divides the speech signal into k+1 bands and 
produces a first set of preliminary voiced/unvoiced ("V/UV") parameters (A 0 to A K ) corresponding to a first estimate as 
to whether the signals in the bands are voiced or unvoiced. Speech signal s(n) is also supplied to a second parameter 
35 estimator 1 6 that produces a second set of preliminary V/UV parameters (B° to B K ) that correspond to a second estimate 
as to whether the signals in the bands are voiced or unvoiced. The two sets of preliminary V/UV parameters are 
combined by a combination block 18 to produce a set of V/UV parameters (V° to V K ). 

With reference to Fig. 2, first parameter estimator 1 4 produces the first voiced/unvoiced estimate using a frequency 
domain approach. Channel processing units 20 in first parameter estimator 14 divide speech signal s(n) into at least 
40 two frequency bands and process the frequency bands to produce a first set of frequency band signals, designated 
as T°(co) .. T'(co). As discussed below, channel processing units 20 are differentiated by the parameters of a bandpass 
filter used in the first stage of each channel processing unit 20. In the described embodiment, there are sixteen channel 
processing units (I equals 15). 

A remap unit 22 transforms the first set of frequency band signals to produce a second set of frequency band 
45 signals, designated as U°((o) .. U K (co). In the described embodiment, there are eight frequency band signals in the 
second set of frequency band signals (K equals 7). Thus, remap unit 22 maps the frequency band signals from the 
sixteen channel processing units 20 into eight frequency band signals. Remap unit 20 does so by combining consecutive 
pairs of frequency band signals from the first set into single frequency band signals in the second set. For example, 
T°(co) and V(co) are combined to produce U°(o)), and T 14 (co) and T 15 (co) are combined to produce U 7 (co). Other ap- 
50 proaches to remapping could also be used. 

Next, voiced/unvoiced parameter estimation units 24, each associated with a frequency band signal from the sec- 
ond set, produce preliminary V/UV parameters A 0 to A K by computing a ratio of the voiced energy in the frequency 
band at an estimated fundamental frequency co° to the total energy in the frequency band and subtracting this ratio 
from 1: 

55 

**=1.0-E£(o> 0 )/E?. 
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The voiced energy in the frequency band is computed as: 



5 




where 



10 



/ n = [(n-0.25) <o 0 . (A7 + 0.25)co 0 ], 



and A/is the number of harmonics of the fundamental frequency <o 0 being considered. V/UV parameter estimation units 
24 determine the total energy of their associated frequency band signals as: 



The degree to which the frequency band signal is voiced varies indirectly with the value of the preliminary V/UV 
parameter. Thus, the frequency band signal is highly voiced when the preliminary V/UV parameter is near zero and is 
highly unvoiced when the parameter is greater than or equal to one half. 

With reference to Fig. 3, when speech signal s(n) enters a channel processing unit 20, components s'(n) belonging 
to a particular frequency band are isolated by a bandpass filter 26. Bandpass filter 26 uses downsampling to reduce 
computational requirements, and does so without any significant impact on system performance. Bandpass filter 26 
can be implemented as a Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filter, or by using an FFT 
In the described embodiment, bandpass filter 26 is implemented using a thirty two point real input FFT to compute the 
outputs of a thirty two point FIR filter at seventeen frequencies, and achieves a downsampling factor of S by shifting 
the input by S samples each time the FFT is computed. For example, if a first FFT used samples one through thirty 
two, a downsampling factor of ten would be achieved by using samples eleven through forty two in a second FFT 

A first nonlinear operation unit 28 then performs a nonlinear operation on the isolated frequency band s'(n) to 
emphasize the fundamental frequency of the isolated frequency band s'(n). For complex values of s'(n) (i greater than 
zero), the absolute value, I s'(n)l , is used. For the real value of s°(n), s°(n) is used if s°(n) is greater than zero and 
zero is used if s°(n) is less than or equal to zero. 

The output of nonlinear operation unit 28 is passed through a lowpass filtering and downsampling unit 30 to reduce 
the data rate and consequently reduce the computational requirements of later components of the system. Lowpass 
filtering and downsampling unit 30 uses an FIR filter computed every other sample for a downsampling factor of two. 

A windowing and FFT unit 32 multiplies the output of lowpass filtering and downsampling unit 30 by a window and 
computes a real input FFT S'(co), of the product. Typically, windowing and FFT unit 32 uses a Hamming window and 
a real input FFT. 

Finally, a second nonlinear operation unit 34 performs a nonlinear operation on S'(co) to facilitate estimation of 
voiced or total energy and to ensure that the outputs of channel processing units 20, T^co), combine constructively if 
used in fundamental frequency estimation. The absolute value squared is used because it makes all components of 
T'(co) real and positive. 

With reference to Fig. 4, second parameter estimator 16 produces the second preliminary V/UV estimates using 
a sinusoid detector/estimator. Channel processing units 36 in second parameter estimator 16 divide speech signal s 
(n) into at least two frequency bands and process the frequency bands to produce a first set of signals, designated as 
R°(l) R'(l). Channel processing units 36 are differentiated by the parameters of a bandpass filter used in the first 
stage of each channel processing unit 36. In the described embodiment, there are sixteen channel processing units (I 
equals 15). The number of channels (value of I) in Fig. 4 does not have to equal the number of channels (value of I) 
in Fig. 2. 

A remap unit 38 transforms the first set of signals to produce a second set of signals, designated as S°(l) .. S K (I). 
The remap unit can be an identity system. In the described embodiment, there are eight signals in the second set of 
signals (K equals 7). Thus, remap unit 38 maps the signals from the sixteen channel processing units 36 into eight 
signals. Remap unit 38 does so by combining consecutive pairs of signals from the first set into single signals in the 
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second set. For example, R°(l) and R 1 (I) are combined to produce S°(l), and R 14 (l) and R l5 (l) are combined to produce 
S 7 (l). Other approaches to remapping could also be used. 

Next, V/UV parameter estimation units 40, each associated with a signal from the second set, produce preliminary 
V/UV parameters B° to B K by computing a ratio of the sinusoidal energy in the signal to the total energy in the signal 
5 and subtracting this ratio from 1 : 

e*= ^.o-s k (^)/s k (0). 

10 With reference to Fig. 5, when speech signal s(n) enters a channel processing unit 36, components s'(n) belonging 

to a particular frequency band are isolated by a bandpass filter 26 that operates identically to the bandpass filters of 
channel processing units 20 (see Fig. 3). It should be noted that, to reduce computation requirements, the same band- 
pass filters may be used in channel processing units 20 and 36, with the outputs of each filter being supplied to a first 
nonlinear operation unit 28 of a channel processing unit 20 and a window and correlate unit 42 of a channel processing 

/5 unit 36. 

A window and correlate unit 42 then produces two correlation values for the isolated frequency band s'(n). The 
first value, R'(0), provides a measure of the total energy in the frequency band: 

20 N- 1 

*i(0) = [A]T [ \s 1 {n) | 2 + \s 1 (n + S) | 2 ] ] 2 
2 n = 0 

25 where N is related to the size of the window and typically defines an interval of 20 milliseconds and S is the number 
of samples by which the bandpass filter shifts the input speech samples. The second value, R'(1 ), provides a measure 
of the sinusoidal energy in the frequency band: 

30 N-l 

R s (l) = |£ s i (n + S)s* i (n) | 2 . 
n=0 

35 Combination block 1 8 produces voiced/unvoiced parameters V° to V K by selecting the minimum of a preliminary 

V/UV parameter from the first set and a function of a preliminary V/UV parameter from the second set. In particular, 
combination block produces the voiced/unvoiced parameters as: 



40 



45 



SO 



\A = m\n(A k J B (B k )) 

where 

f B (Ef)=B k +a(k) p(o> 0 ), 
p(co o ) =1.0, when co 0 >27i/60.0, or 
2 7i/(60 co ) , when co <2rc/60.0 



and o.(k) is an increasing function of k. Because a preliminary V/UV parameter having a value close to zero has a 
higher probability of being correct than a preliminary V/UV parameter having a larger value, the selection of the minimum 
55 value results in the selection of the value that is most likely to be correct. 

With reference to Fig. 6, in another embodiment, a first parameter estimator 14' produces the first preliminary V/ 
UV estimate using an autocorrelation domain approach. Channel processing units 44 in first parameter estimator 14* 
divide speech signal s(n) into at least two frequency bands and process the frequency bands to produce a first set of 



BNSDOCtD: <£P_JJ72216SA2JL> 



EP0 722 165 A2 



frequency band signals, designated as T°(l) .. T K (I). There are eight channel processing units (K equals 7) and no 
remapping unit is necessary. 

Next, voiced/unvoiced (V/UV) parameter estimation units 46, each associated with a channel processing unit 44, 
produce preliminary V/UV parameters A 0 to A K by computing a ratio of the voiced energy in the frequency band at an 
5 estimated pitch period n Q to the total energy in the frequency band and subtracting this ratio from 1 : 

^=1.0-E*(n o )/ E k . 
10 The voiced energy in the frequency band is computed as: 

^Cg = C(n 0 ) 7%J 

is where 



C(n 0 ) = 



20 



45 



N- 1 

]T w(n) w{n+n 0 ) 

71 = 0 



N is the number of samples in the window and typically has a value of 101 , and C(n c ) compensates for the window 
25 roll-off as a function of increasing autocorrelation lag. For non-integer values of n Q , the voiced energy at the nearest 

three values of n are used with a parabolic interpolation method to obtain the voiced energy for n Q . The total energy 

is determined as the voiced energy for n Q equal to zero. 

With reference to Fig. 7, when speech signal s(n) enters a channel processing unit 44, components s'(n) belonging 

to a. particular frequency band are isolated by a bandpass filter 48. Bandpass filter 48 uses downsampling to reduce 
30 computational requirements, and does so without any significant impact on system performance. Bandpass filter 48 

can be implemented as a Finite Impulse Response (FIR) or Infinite Impulse Response (MR) filter or by using an FFT 

A downsampling factor of S is achieved by shifting the input speech samples by S each time the filter outputs are 

computed. 

A nonlinear operation unit 50 then performs a nonlinear operation on the isolated frequency band s'(n) to emphasize 
35 the fundamental frequency of the isolated frequency band s'(n). For complex values of s'(n) (i greater than zero), the 
absolute value, I s'(n)l , is used. For the real value of s°(n), no nonlinear operation is performed. 

The output of nonlinear operation unit 50 is passed through a highpass filter 52, and the output of the highpass 
filter is passed through an autocorrelation unit 54. A 101 point window is used, and, to reduce computation, the auto- 
correlation is only computed at a few samples nearest the pitch period. 
40 With reference again to Fig. 4, second parameter estimator 16 may also use other approaches to produce the 

second voiced/unvoiced estimate. For example, well-known techniques such as using the height of the peak of the 
cepstrum, using the height of the peak of the autocorrelation of a linear prediction coder residual, MBE model parameter 
estimation methods, or IMBE (TM) model parameter estimation methods may be used. In addition, with reference again 
to Fig. 5, window and correlate unit 42 may produce autocorrelation values for the isolated frequency band s'(n) as: 



(I) = Re [5Z s i (ia'+I> w(n + l) s mi (n) w(n) ] 



n 

so where w(n) is the window. With this approach, combination block 18 produces the voiced/unvoiced parameters as: 

V* = m\n{A k , B k ). 

55 The fundamental frequency may be estimated using a number of approaches. First, with reference to Fig. 8, a 

fundamental frequency estimation unit 56 includes a combining unit 58 and an estimator 60. Combining unit 58 sums 
the T'(co) outputs of channel processing units 20 (Fig. 2) to produce X(co). In an alternative approach, combining unit 
58 could estimate a signal-to-noise ratio (SNR) for the output of each channel processing unit 20 and weigh the various 
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outputs so that an output with a higher SNR contributes more to X(co) than does an output with a lower SNR. 

Estimator 60 then estimates the fundamental frequency (co Q ) by selecting a value for (o 0 that maximizes X(to 0 ) over 
an interval from co mjn to <o max . Since X(co) is only available at discrete samples of to, parabolic interpolation of X(co 0 ) 
near co Q is used to improve accuracy of the estimate. Estimator 60 further improves the accuracy of the fundamental 
estimate by combining parabolic estimates near the peaks of the N harmonics of co Q within the bandwidth of X(io). 

Once an estimate of the fundamental frequency is determined, the voiced energy E v (co c ) is computed as: 



w 



N 

- £ 



j3=i <* a ei n 



where 

15 

/ n = [(n-0.25)<o OI <n+0.25)G> o ]. 



Thereafter, the voiced energy E v (0.5co o ) is computed and compared to E v (co G ) to select between co Q and 0.5co o as the 
20 final estimate of the fundamental frequency. 

With reference to Fig. 9, an alternative fundamental frequency estimation unit 62 includes a nonlinear operation 

unit 64, a windowing and Fast Fourier Transform (FFT) unit 66, and an estimator 68. Nonlinear operation unit 64 

performs a nonlinear operation, the absolute value squared, on s(n) to emphasize the fundamental frequency of s(n) 

and to facilitate determination of the voiced energy when estimating co Q . 
25 Windowing and FFT unit 66 multiplies the output of nonlinear operation unit 64 to segment it and computes an 

FFT, X(w), of the resulting product. Finally, estimator 68, which works identically to estimator 60, generates an estimate 

of the fundamental frequency. 

With reference to Fig. 10, a hybrid fundamental frequency estimation unit 70 includes a band combination and 

estimation unit 72, an IMBE estimation unit 74 and an estimate combination unit 76. Band combination and estimation 
30 unit 70 combines the outputs of channel processing units 20 (Fig. 2) using simple summation or a signal-to-noise ratio 

(SNR) weighting where bands with higher SNRs are given higher weight in the combination. From the combined signal 

(U(co)), unit 72 estimates a fundamental frequency and a probability that the fundamental frequency is correct. Unit 72 

estimates the fundamental frequency by choosing the frequency that maximizes the voiced energy (E v (co 0 )) from the 

combined signal, which is determined as: 

35 



N 



><<* 0 ) = £ E "<«•>»> 



73 = 1 



40 



where 



45 



50 



/ 0 = [(n-0.25)co oJ (n+0.25) C oJ. 

and N is the number of harmonics of the fundamental frequency. The probability that co 0 is correct is estimated by 
comparing E v (co c ) to the total energy E t , which is computed as: 

B t - w £ U(u m ) . 

Vco > 0 . 5o) rt 



When E v (co 0 ) is close to E t , the probability estimate is near one. When E v (co c ) is close to one half of E,, the probability 
estimate is near zero. 

IMBE estimation unit 74 uses the well known IMBE technique, or a similar technique, to produce a second funda- 
mental frequency estimate and probability of correctness. Thereafter, estimate combination unit 76 combines the two 
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fundamental frequency estimates to produce the final fundamental frequency estimate. The probabilities of correctness 
are used so that the estimate with higher probability of correctness is selected or given the most weight. 

With reference to Fig. 11, a voiced/unvoiced parameter smoothing unit 78 performs a smoothing operation to 
remove voicing errors that might result from rapid transitions in the speech signal. Unit 78 produces a smoothed voiced/ 
5 unvoiced parameter as: 

v k (n) = 1.Q, whenv k {n-^)v k (n+^) = ^ and 

to lA^). otherwise 

where the voiced/unvoiced parameters equal zero for unvoiced speech and one for voiced speech. When the voiced/ 
unvoiced parameters have continuous values, with a value near zero corresponding to highly voiced speech, unit 78 
produces a smoothed voiced/unvoiced parameter that is smoothed in both the time and frequency domains: 

15 

v k s (n) =), k (n)min(v k (n), a k (n), p\n), y k (n)) 

where 

20 

a (n) = 2v (n), when k=QA ,...,K^ , or 
co, when k=K; 

25 

$ k (n) =2v k '\n), whenk=2,2,..,K,or 
co, when /c-0,1 ; 

30 

y k (n) = 0.25 Z" 1 (n) + 0.5v k (n) + 0.25 (n) t 
when k=1 , 2, or 

35 

co, when k=0, K; 

X k {n) = 0.8, when v k (n^)< T^n-I ; and 
|co o (n)- ( o o (n-1)| < 0.25 |co c (n)| , or 

1, otherwise: 

45 and Vtfn) is a threshold value that is a function of time and frequency. 

With reference to Fig. 12, a voiced/unvoiced parameter improvement unit 80 produces improved voiced/unvoiced 
parameters by comparing the voiced/unvoiced parameter produced when the estimated fundamental frequency equals 
co 0 to a voiced/unvoiced parameter produced when the estimated fundamental frequency equals one half of co Q and 
selecting the parameter having the lowest value. In particular, voiced/unvoiced parameter improvement unit 80 pro- 

so duces improved voiced/unvoiced parameters as: 

A k (<o 0 ) = min(**(fl> 0 ), **(0.5co o )) 

ss where 

A k (e>)= 1.0- 
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With reference to Fig. 13, an improved estimate of the fundamental frequency (co 0 ) is generated according to a 
procedure 100. The initial fundamental frequency estimate (*> 0 ) is generated according to one of the procedures 
described above and is used in step 101 to generate a set of evaluation frequencies 5 k . The evaluation frequencies 
are typically chosen to be near the integer submultiples and multiples of *3 0 . Thereafter, functions are evaluated at 
this set of evaluation frequencies (step 102). The functions that are evaluated typically consist of the voiced energy 
function E v ( £ k ) and the normalized frame error Ef( 5 k ). The normalized frame error is computed as 



E f {W k ) =1.0-^(0*) /E t (T3 k ) . 



10 



The final fundamental frequency estimate is then selected (step 103) using the evaluation frequencies, the function 
values at the evaluation frequencies, the predicted fundamental frequency (described below), the final fundamental 
frequency estimates from previous frames ; and the above function values from previous frames. When these inputs 

'5 indicate that one evaluation frequency has a much higher probability of being the correct fundamental frequency than 
the others, then it is chosen. Otherwise : if two evaluation frequencies have similar probability of being correct and the 
normalized error for the previous frame is relatively low, then the evaluation frequency closest to the final fundamental 
frequency from the previous frame is chosen. Otherwise, if two evaluation frequencies have similar probability of being 
correct, then the one closest to the predicted fundamental frequency is chosen. The predicted fundamental frequency 

20 for the next frame is generated (step 1 04) using the final fundamental frequency estimates from the current and previous 
frames, a delta fundamental frequency, and normalized frame errors computed at the final fundamental frequency 
estimate for the current frame and previous frames. The delta fundamental frequency is computed from the frame to 
frame difference in the final fundamental frequency estimate when the normalized frame errors for these frames are 
relatively low and the percentage change in fundamental frequency is low, otherwise, it is computed from previous 

25 values. When the normalized error for the current frame is relatively low, the predicted fundamental for the current 
frame is set to the final fundamental frequency. The predicted fundamental for the next frame is set to the sum of the 
predicted fundamental for the current frame and the delta fundamental frequency for the current frame. 

30 Claims 

1 . A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising dividing the digitized speech signal into one or 
more frequency band signals: and, preferably at regular intervals of time, performing the further step of: determining 

35 a first preliminary excitation parameter using a first method that includes performing a nonlinear operation on at 

least one of the frequency band signals to produce at least one modified frequency band signal and determining 
the first preliminary excitation parameter using the at least one modified frequency band signal; determining at 
least a second preliminary excitation parameter using at least a second method different from the said first method: 
and using the first and at least a second preliminary excitation parameters to determine an excitation parameter 

4 o for the digitized speech signal. 

2. A method according to Claim 1 , wherein at least one of the second methods uses at least one of the frequency 
band signals without performing the said nonlinear operation. 

45 3. A method according to Claims 1 or 2, wherein the excitation parameter comprises a voiced/unvoiced parameter 
for at least one frequency band, said parameter preferably having values that vary over a continuous range. 

4. A method according to any preceding claim, further comprising determining a fundamental frequency for the dig- 
itized speech signal. 



50 



55 



A method according to Claim 3, wherein the first preliminary excitation parameter comprises a first voiced/unvoiced 
parameter for the at least one modified frequency band signal, and wherein the first determining step includes 
determining the first voiced/unvoiced parameter by comparing voiced energy in the modified frequency band signal 
to total energy in the modified frequency band signal. 

A method according to Claim 5, wherein the voiced energy in the modified frequency band signal corresponds to 
the energy associated with an estimated fundamental frequency for the digitized speech signal. 
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7. A method according to Claim 5, wherein the voiced energy in the modified frequency band signal corresponds to 
the energy associated with an estimated pitch period for the digitized speech signal. 

8. A method according to Claim 5, wherein the second preliminary excitation parameter includes a second voiced/ 
unvoiced parameter for the at least one frequency band signal, and wherein the second determining step includes 
determining the second voiced/unvoiced parameter by comparing sinusoidal energy in the at least one frequency 
band signal to total energy in the at least one frequency band signal. 

9. A method according to Claim 5, wherein the second preliminary excitation parameter includes a second voiced/ 
unvoiced parameter for the at least one frequency band signal, and wherein the second determining step includes 
determining the second voiced/unvoiced parameter by autocorrelating the at least one frequency band signal. 

10. A method according to any preceding claim, wherein the said using step emphasizes the first preliminary excitation 
parameter over the second preliminary excitation parameter in determining the excitation parameter for the digitized 
speech signal when the first preliminary excitation parameter has a higher probability of being correct than does 
the second J preliminary excitation parameter. 

11. A method according to any preceding claim, further comprising smoothing the excitation parameter to produce a 
smoothed excitation parameter. 

12. A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising the steps of: determining preliminary excitation 
parameters from the digitized speech signal; and smoothing the preliminary excitation parameters to produce 
excitation parameters. 

13. A method according to Claim 12, wherein the preliminary excitation parameters include a preliminary voiced/un- 
voiced parameter for at least one frequency band and the excitation parameters include a voiced/unvoiced pa- 
rameter for at least one frequency band, which voiced/unvoiced parameter preferably has values that vary over a 
continuous range. 

14. A method according to Claim 13, wherein the excitation parameters include a fundamental frequency. 

15. A method according to Claims 13 or 14, wherein the smoothing step makes the voiced/unvoiced parameter more 
voiced than the preliminary voiced/unvoiced parameter when voiced/unvoiced parameters that are nearby in time 
and/or frequency are voiced. 

16. A method according to Claim 1 2, wherein the smoothing step is performed as a function of time and/or frequency. 

17. A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising the steps of: estimating a fundamental frequency 
for the digitized speech signal; evaluating a voiced/unvoiced function using the estimated fundamental frequency 
to produce a first preliminary voiced/unvoiced parameter; evaluating the voiced/unvoiced function at least using 
one other frequency derived from the estimated fundamental frequency to produce at least one other preliminary 
voiced/unvoiced parameter; and combining the first and at least one other preliminary voiced/unvoiced parameters 
to produce a voiced/unvoiced parameter. 

1 8. A method according to Claim 1 7, wherein the said at least one other frequency is derived from the said estimated 
fundamental frequency as a multiple or submultiple of the said estimated fundamental frequency. 

19. A method according to Claim 17, wherein the combining step includes choosing the first preliminary voiced/un- 
voiced parameter as the voiced/unvoiced parameter when the first preliminary voiced/unvoiced parameter indi- 
cates that the digitized speech signal is more voiced than does the second preliminary voiced/unvoiced parameter 

20. A method of synthesizing speech using excitation parameters, where the excitation parameters are estimated by 
using a method for determining such parameters according to any preceding claim. 

21. A method of analysing a digitized speech signal to determine a fundamental frequency estimate for the digitized 
speech signal, comprising the steps of: determining a predicted fundamental frequency estimate from previous 
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fundamental frequency estimates; determining an initial fundamental frequency estimate; evaluating an error func- 
tion at the initial fundamental frequency estimate to produce a first error function value; evaluating the error function 
at at least one other frequency derived from the initial fundamental frequency estimate to produce at least one 
other error function value; selecting a fundamental frequency estimate using the predicted fundamental frequency 
estimate, the initial fundamental frequency estimate, the first error function value, and the at least one other error 
function value. 

22. A method according to Claim 21 , wherein the said at least one other frequency is derived from the said estimated 
fundamental frequency as a multiple or submultiple of the said estimated fundamental frequency. 

23. A method according to Claim 21, wherein the predicted fundamental frequency is determined by adding a delta 
factor to a previous predicted fundamental frequency, which delta factor is preferably determined from previous 
first and at least one other error function values, the previous predicted fundamental frequency, and a previous 
delta factor. 

24. A method of synthesizing speech using a fundamental frequency, where the fundamental frequency is estimated 
using a method according to any of Claims 21 , 22 or 23. 

25. A system for analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
comprising: means for dividing the digitized speech signal into one or more frequency band signals: means for 
determining a first preliminary excitation parameter using a first method that includes performing a nonlinear op- 
eration on at least one of the frequency band signals to produce at least one modified frequency band signal and 
determining the first preliminary excitation parameter using the at least one modified frequency band signal: means 
for determining a second preliminary excitation parameter using a second method that is different from the above 
said first method; and means for using the first and second preliminary excitation parameters to determine an 
excitation parameter for the digitized speech signal. 

26. A system for analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
comprising: means for determining preliminary excitation parameters from the digitized speech signal: and means 
for smoothing the preliminary excitation parameters to produce excitation parameters. 

27. A system for analysing a digitized speech signal to determine modified excitation parameters for the digitized 
speech signal, comprising: means for estimating a fundamental frequency for the digitized speech signal; means 
for evaluating a voiced/unvoiced function using the estimated fundamental frequency to produce a first preliminary 
voiced/unvoiced parameter: means for evaluating the voiced/unvoiced function using another frequency derived 
from the estimated fundamental frequency to produce a second preliminary voiced/unvoiced parameter; and 
means for combining the first and second preliminary voiced/unvoiced parameters to produce a voiced/unvoiced 
parameter. 

28. A system for analysing a digitized speech signal to determine a fundamental frequency estimate for the digitized 
speech signal, comprising: means for determining a predicted fundamental frequency estimate from previous fun- 
damental frequency estimates: means for determining an initial fundamental frequency estimate: means for eval- 
uating an error function at the initial fundamental frequency estimate to produce a first error function value; means 
for evaluating the error function at at least one other frequency derived from the initial fundamental frequency 
estimate to produce a second error function value; and means for selecting a fundamental frequency estimate 
using the predicted fundamental frequency estimate, the initial fundamental frequency estimate, the first error 
function value, and the second error function value. 

29. A method of analysing a digitized speech signal to determine a voiced/unvoiced function for the digitized speech 
signal, comprising: dividing the digitized speech signal into at least two frequency band signals: determining a first 
preliminary voiced/unvoiced function for at least two of the frequency band signals using a first method: determining 
a second preliminary voiced/unvoiced function for at least two of the frequency band signals using a second method 
which is different from the above said first method; and using the first and second preliminary excitation parameters 
to determine a voiced/unvoiced function for at least two of the frequency band signals. 
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(54) Estimation of excitation parameters 

(57) Excitation parameters for a digitized speech 
signal are determined by analysing the digitized speech 
signal. The digitized speech signal is divided into at least 
two frequency bands. A first preliminary excitation pa- 
rameter is determined by performing a nonlinear oper- 
ation on at least one of the frequency band signals to 
produce a modified frequency band signal and deter- 
mining the first preliminary excitation parameter using 
the modified frequency band signal. A second prelimi- 



nary excitation parameter is determined using a method 
different from the first method. The first and second pre- 
liminary excitation parameters are used to determine an 
excitation parameter for the digitized speech signal. The 
method is useful in encoding speech. Speech synthe- 
sized using the parameters estimated based on the in- 
vention generates high quality speech at various bit 
rates useful for applications such as satellite voice com- 
munication. 



,14 



10 



A2 



S(t) 



SAMPLING 



s(n) 



FIG. 1 



PARAMETER 




ESTIMATOR 


A 


✓16 


PARAMETER 




ESTIMATOR 


B 



A 0 



B° 



Printed by Jouve. 75001 PARIS (FH) 



EP0 722 165 A2 



Description 

The invention has arisen from work seeking to improve theaccuracy with which excitation parameters are estimated 
in speech analysis and synthesis. 

5 Speech analysis and synthesis are widely used in applications such as telecommunications and voice recognition. 

A vocoder, which is a type of speech analysis/synthesis system, models speech as the response of a system to exci- 
tation over short time intervals. Examples of vocoder systems include linear prediction vocoders, homomorphic voco- 
ders, channel vocoders, sinusoidal transform coders ("STC"), multiband excitation ("MBE") vocoders, improved multi- 
band excitation ("IMBE (TM)") vocoders. 

10 Vocoders typically synthesize speech based on excitation parameters and system parameters. Typically, an input 

signal is segmented using, for example, a Hamming window. Then, for each segment, system parameters and excitation 
parameters are determined System parameters include the spectral envelope or the impulse response of the system. 
Excitation parameters include a fundamental frequency (or pitch) and a voiced/unvoiced parameter that indicates 
whether the input signal has pitch (or indicates the degree to which the input signal has pitch). In vocoders that divide 

is the speech into frequency bands, such as IMBE (TM) vocoders, the excitation parameters may also include a voiced/ 
unvoiced parameter for each frequency band rather than a single voiced/unvoiced parameter. Accurate excitation 
parameters are essential for high quality speech synthesis. 

When the voiced/unvoiced parameters include only a single voiced/unvoiced decision for the entire frequency 
band, the synthesized speech tends to have a "buzzy" quality especially noticeable in regions of speech which contain 

20 mixed voicing or in voiced regions of noisy speech. A number of mixed excitation models have been proposed as 
potential solutions to the problem of "buzziness" in vocoders. In these models, periodic and noise-like excitations are 
mixed which have either time-invariant or time-varying spectral shapes. 

In excitation models having time-invariant spectral shapes, the excitation signal consists of the sum of a periodic 
source and a noise source with fixed spectral envelopes. The mixture ratio controls the relative amplitudes of the 

25 periodic and noise sources. Examples of such models include Itakura and Saito, "Analysis Synthesis Telephony Based 
upon the Maximum Likelihood Method," Reports of 6th Int Cong. Acoust , Tokyo, Japan, Paper C-5-5, pp. C17-20, 
1 968; and Kwon and Goldberg, "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch," IEEE Trans, on Acoust., 
Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 1984. In theses excitation models a white 
noise source is added to a white periodic source. The mixture ratio between these sources is estimated from the height 

30 of the peak of the autocorrelation of the LPC residual. 

In excitation models having time-varying spectral shapes, the excitation signal consists of the sum of a periodic 
source and a noise source with time varying spectral envelope shapes. Examples of such models include Fujimara, 
"An Approximation to Voice Aperiodicity," IEEE Trans. Audio and Electroacoust , pp. 68-72, March 1968; Makhoul et 
at., "A Mixed-Source Excitation Model for Speech Compression and Synthesis," IEEE Int. Conf. on Acoust. Sp. & Sig. 

35 Proa, April 1978. pp. 163-166; Kwon and Goldberg. "An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch," 
IEEE Trans, on Acoust., Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858. August 1984; and Griffin 
and Lim, "Multiband Excitation Vocoder/ IEEE Trans. 

Acoust.. Speech, Signal Processing, vol. ASSP-36. pp. 1223-1235. Aug. 1988 

In the excitation model proposed by Fujimara, the excitation spectrum is divided into three fixed frequency bands 
40 A separate cepstral analysis is performed for each frequency band and a voiced/unvoiced decision for each frequency 

band is made based on the height of the cepstrum peak as a measure of periodicity. 

In the excitation model proposed by Makhoul et aL the excitation signal consists of the sum of a low-pass periodic 

source and a high-pass noise source The low-pass periodic source is generated by filtering a white pulse source with 

a variable cut-off low-pass filter Similarly, the high -pass noise source was generated by filtering a white noise source 
45 with a variable cut-off high-pass filter. The cut-off frequencies for the two filters are equal and are estimated by choosing 

the highest frequency at which the spectrum is periodic. Periodicity of the spectrum is determined by examining the 

separation between consecutive peaks and determining whether the separations are the same, within some tolerance 

level. 

In a second excitation model implemented by Kwon and Goldberg, a pulse source is passed through a variable 
so gain low-pass filter and added to itself, and a white noise source is passed through a variable gain high-pass filter and 
added to itself. The excitation signal is the sum of the resultant pulse and noise sources with the relative amplitudes 
controlled by a voiced/unvoiced mixture ratio. The filter gains and voiced/unvoiced mixture ratio are estimated from 
the LPC residual signal with the constraint that the spectral envelope of the resultant excitation signal is flat. 

In the multiband excitation model proposed by Griffin and Lim, a frequency dependent voiced/unvoiced mixture 
55 function is proposed. This model is restricted to a frequency dependent binary voiced/unvoiced decision for coding 
purposes. A further restriction of this model divides the spectrum into a finite number of frequency bands with a binary 
voiced/unvoiced decision for each band. The voiced/unvoiced information is estimated by comparing the speech spec- 
trum to the closest periodic spectrum. When the error is below a threshold, the band is marked voiced, otherwise, the 
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band is marked unvoiced. 

Excitation parameters may also be used in applications, such as speech recognition, where no speech synthesis 
is required. Once again, the accuracy of the excitation parameters directly affects the performance of such a system. 

In one aspect, generally, the invention features a hybrid excitation parameter estimation technique that produces 
two sets of excitation parameters for a speech signal using two different approaches and combines the two sets to 
produce a single set of excitation parameters. In a first approach, the technique applies a nonlinear operation to the 
speech signal to emphasize the fundamental frequency of the speech signal. In a second approach, we use a different 
method which may or may not include a nonlinear operation. While the first approach produces highly accurate exci- 
tation parameters under most conditions, the second approach produces more accurate parameters under certain 
conditions. By using both approaches and combining the resulting sets of excitation parameters to produce a single 
set. our technique produces accurate results under a wider range of conditions than are produced by either of the 
approaches individually. 

In typical approaches to determining excitation parameters, an analog speech signal s(t) is sampled to produce a 
speech signal s(n). Speech signal s(n) is then multiplied by a window w(n) to produce a windowed signal s^n) that is 
commonly referred to as a speech segment or a speech frame. A Fourier transform is then performed on windowed 
signal s w (n) to produce a frequency spectrum S^o) from which the excitation parameters are determined. 

When speech signal s(n) is periodic with a fundamental frequency co Q or pitch period n c (where n 0 equals 2k/<o 0 ), 
the frequency spectrum of speech signal s(n) should be a tine spectrum with energy at co G and harmonics thereof 
(integral multiples of a) 0 ). As expected, S w (co) has spectral peaks that are centered around co 0 and its harmonics. 
However, due to the windowing operation, the spectral peaks include some width, where the width depends on the 
length and shape of window w(n) and tends to decrease as the length of window w(n) increases. This window-induced 
error reduces the accuracy of the excitation parameters. Thus, to decrease the width of the spectral peaks, and to 
thereby increase the accuracy of the excitation parameters, the length of window w(n) should be made as long as 
possible. 

The maximum useful length of window w(n) is limited. Speech signals are not stationary signals, and instead have 
fundamental frequencies that change over time. To obtain meaningful excitation parameters, an analyzed speech seg- 
ment must have a substantially unchanged fundamental frequency. Thus, the length of window w(n) must be short 
enough to ensure that the fundamental frequency will not change significantly within the window. 

In addition to limiting the maximum length of window w(n), a changing fundamental frequency tends to broaden 
the spectral peaks. This broadening effect increases with increasing frequency. For example, if the fundamental fre- 
quency changes by Ao> 0 during the window, the frequency of the mth harmonic, which has a frequency of mo) 0 , changes 
by mAco 0 so that the spectral peak corresponding to m<o 0 is broadened more than the spectral peak corresponding to 
co 0 . This increased broadening of the higher harmonics reduces the effectiveness of higher harmonics in the estimation 
of the fundamental frequency and the generation of voiced/unvoiced parameters for high frequency bands. 

By applying a nonlinear operation to the speech signal, the increased impact on higher harmonics of a changing 
fundamental frequency is reduced or eliminated, and higher harmonics perform better in estimation of the fundamental 
frequency and determination of voiced/unvoiced parameters. Suitable nonlinear operations map from complex (or real) 
to real values and produce outputs that are nondecreasing functions of the magnitudes of the complex (or real) values. 
Such operations include, for example, the absolute value, the absolute value squared, the absolute value raised to 
some other power, or the log of the absolute value. 

Nonlinear operations tend to produce output signals having spectral peaks at the fundamental frequencies of their 
input signals. This is true even when an input signal does not have a spectral peak at the fundamental frequency. For 
example, if a bandpass filter that only passes frequencies in the range between the third and fifth harmonics of co 0 is 
applied to a speech signal s(n), the output of the bandpass filter, x(n), will have spectral peaks at 3<o 0 , 4o> 0 and 5co 0 . 

Though x(n) does not have a spectral peak at <o 0 . I x(n)l 2 will have such a peak. For a real signal x(n), I x(n)l 2 is 
equivalent to x 2 (n). As is well known, the Fourier transform of x 2 (n) is the convolution of X(a>), the Fourier transform of 
x(n), with X(co); 



The convolution of X((o) with X(co) has spectral peaks at frequencies equal to the differences between the frequencies 
for which X(co) has spectral peaks. The differences between the spectral peaks of a periodic signal are the fundamental 
frequency and its multiples. Thus, in the example in which X(<o) has spectral peaks at 3o) 0 , 4co 0 and 5co 0 , X(o>) convolved 
with X(co) has a spectral peak at co 0 (4o) 0 -3co 0 , 5co 0 -4o> 0 ). For a typical periodic signal, the spectral peak at the funda- 
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mental frequency is likely to be the most prominent. 

The above discussion also applies to complex signals. For a complex signal x(n), the Fourier transform of I x(n)l 2 is: 



io This is an autocorrelation of X(o>) with X*(co), and also has the property that spectral peaks separated by nco 0 produce 
peaks at nco 0 . 

Even though I x(n)l , I x(n)l a for some real "a", and log I x(n)l are not the same as I x(n)l 2 , the discussion above 
for I x(n)l 2 applies approximately at the qualitative level. For example, for I x(n)l = y(n)° 5 , where y(n) = I x(n)l 2 a Taylor 



is 



20 



3S 



40 



series expansion of y(n) can be expressed as: 



| = a k y k (n) 



Because multiplication is associative, the Fourier transform of the signal y^n) is Y(co) convolved with the Fourier trans- 
form of y^n). The behavior for nonlinear operations other than I x(n)l 2 can be derived from I x(n)l 2 by observing the 
behavior of multiple convolutions of Y(co) with itself. If Y(co) has peaks at nco 0 , then multiple convolutions of Y(<o) with 

2S itself will also have peaks at no> 0 . 

As shown, nonlinear operations emphasize the fundamental frequency of a periodic signal, and are particularly 
useful when the periodic signal includes significant energy at higher harmonics. However, the presence of the nonlin- 
earity can degrade performance in some cases. For example, performance may be degraded when speech signal s 
(n) is divided into multiple bands s'(n) using bandpass filters, where s'(n) denotes the result of bandpass filtering using 

30 the ith bandpass filter. If a single harmonic of the fundamental frequency is present in the pass band of the ith filter, 
the output of the filter is 

s (n) - A k e 



where is the frequency, 9 k is the phase, and A k is the amplitude of the harmonic. When a nonlinearity such as the 
absolute value is applied to s'(n) to produce a value y'(n), the result is: 



y(n) = \s'{n)\ = 



so that the frequency information has been completely removed from the signal y"(n). Removal of this frequency infor- 
mation can reduce the accuracy of parameter estimates. 

Our hybrid technique provides significantly improved parameter estimation performance in cases for which the 

45 nonlinearity reduces the accuracy of parameter estimates while maintaining the benefits of the nonlinearity in the re- 
maining cases. As described above, the hybrid technique includes combining parameter estimates based on the signal 
after the nonlinearity has been applied (y*(n)) with parameter estimates based on the signal before the nonlinearity is 
applied (s'(n) or s(n)). The two approaches produce parameter estimates along with an indication of the probability of 
correctness of these parameter estimates. The parameter estimates are then combined giving higher weight to esti- 

so mates with a higher probability of being correct. 

In another aspect, generally, the invention features the application of smoothing techniques to the voiced/unvoiced 
parameters. Voiced/unvoiced parameters can be binary or continuous functions of time and/or frequency. Because 
these parameters tend to be smooth functions in at least one direction (positive or negative) of time or frequency, the 
estimates of these parameters can benefit from appropriate application of smoothing techniques in time and/or fre- 

55 quency. 

The invention also features an improved technique for estimating voiced/unvoiced parameters In vocoders such 
as linear prediction vocoders, homomorphic vocoders, channel vocoders, sinusoidal transform coders, multiband ex- 
citation vocoders, and IMBE (TM) vocoders, a pitch period n (or equivalently a fundamental frequency) is selected. 
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Thereafter, a function f'(n) is then evaluated at the selected pitch period (or fundamental frequency) to estimate the Ah 
voiced/unvoiced parameter. However, for some speech signals, evaluation of this function only at the selected pitch 
period will result in reduced accuracy of one or more voiced/unvoiced parameter estimates. This reduced accuracy 
may result from speech signals that are more periodic at a multiple of the pitch period than at the pitch period, and 

s may be frequency dependent so that only certain portions of the spectrum are more periodic at a multiple of the pitch 
period. Consequently, the voiced/unvoiced parameter estimation accuracy can be improved by evaluating the function 
f'(n) at the pitch period n and at its multiples, and thereafter combining the results of these evaluations. 

In another aspect, the invention features an improved technique for estimating the fundamental frequency or pitch 
period. When the fundamental frequency <o 0 (or pitch period n Q ) is estimated, there may be some ambiguity as to 

io whether co Q or a submultiple or multiple of o> 0 is the best choice for the fundamental frequency. Since the fundamental 
frequency tends to be a smooth function of time for voiced speech, predictions of the fundamental frequency based 
on past estimates can be used to resolve ambiguities and improve the fundamental frequency estimate. 

Other features and advantages of the invention will be apparent from the following description of the preferred 
embodiments, given by way of example only. 

J 5 in the drawings: 

Fig. 1 is a block diagram of a system for determining whether frequency bands of a signal are voiced or unvoiced. 

Fig. 2 is a block diagram of a parameter estimation unit of the system of Fig. 1. 

Fig. 3 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 2. 

Fig. 4 is a block diagram of a parameter estimation unit of the system of Fig. 1 . 
20 Fig. 5 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 4. 

Fig. 6 is a block diagram of a parameter estimation unit of the system of Fig. 1 . 

Fig. 7 is a block diagram of a channel processing unit of the parameter estimation unit of Fig. 6. 

Figs. 8-10 are block diagrams of systems for determining the fundamental frequency of a signal. 

Fig. 11 is a block diagram of voiced/unvoiced parameter smoothing unit. 
2S Fig. 12 is a block diagram of voiced/unvoiced parameter improvement unit. 

Fig. 13 is a block diagram of a fundamental frequency improvement unit. 

Figs. 1-12 show the structure of a system for estimating excitation parameters, the various blocks and units of 
which are preferably implemented with software. 

With reference to Fig. 1, a voiced/unvoiced determination system 10 includes a sampling unit 12 that samples an 
30 analog speech signal s(t) to produce a speech signal s(n). For typical speech coding applications, the sampling rate 
ranges between six kilohertz and ten kilohertz. 

Speech signal s(n) is supplied to a first parameter estimator 1 4 that divides the speech signal into k+1 bands and 
produces a first set of preliminary voiced/unvoiced ("V/UV) parameters (A 0 to A K ) corresponding to a first estimate as 
to whether the signals in the bands are voiced or unvoiced. Speech signal s(n) is also supplied to a second parameter 
35 estimator 1 6 that produces a second set of preliminary V/UV parameters (B° to B K ) that correspond to a second estimate 
as to whether the signals in the bands are voiced or unvoiced. The two sets of preliminary V/UV parameters are 
combined by a combination block 18 to produce a set of V/UV parameters (V° to V*). 

With reference to Fig. 2. first parameter estimator 14 produces the first voiced/unvoiced estimate using a frequency 
domain approach Channel processing units 20 in first parameter estimator 14 divide speech signal s(n) into at least 
40 two frequency bands and process the frequency bands to produce a first set of frequency band signals, designated 
as T°((o) .. T'(o)). As discussed below, channel processing units 20 are differentiated by the parameters of a bandpass 
filter used in the first stage of each channel processing unit 20. In the described embodiment, there are sixteen channel 
processing units (I equals 15). 

A remap unit 22 transforms the first set of frequency band signals to produce a second set of frequency band 
4S signals, designated as U°(o>) .. U K (w). In the described embodiment, there are eight frequency band signals in the 
second set of frequency band signals (K equals 7). Thus, remap unit 22 maps the frequency band signals from the 
sixteen channel processing units 20 into eight frequency band signals. Remap unit 20 does so by combining consecutive 
pairs of frequency band signals from the first set into single frequency band signals in the second set. For example, 
T°(co) and T 1 (to) are combined to produce U°(o)). and T 14 (o>) and T lS (o) are combined to produce U 7 (cd). Other ap- 
so proaches to remapping could also be used. 

Next, voiced/unvoiced parameter estimation units 24, each associated with a frequency band signal from the sec- 
ond set, produce preliminary V/UV parameters A 0 to A K by computing a ratb of the voiced energy in the frequency 
band at an estimated fundamental frequency a*° to the total energy in the frequency band and subtracting this ratio 
from 1: 



55 



**=1.0- £*(0> o )/E?. 
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The voiced energy in the frequency band is computed as 



s 




where 



10 



/„ = [<n-0.25) <o 0 , (n+0.25) o> 0 ] ( 



and A/is the number of harmonics of the fundamental frequency (o Q being considered. V/UV parameter estimation units 
24 determine the total energy of their associated frequency band signals as: 



The degree to which the frequency band signal is voiced varies indirectly with the value of the preliminary V/UV 
parameter Thus, the frequency band signal is highly voiced when the preliminary V/UV parameter is near zero and is 
highly unvoiced when the parameter is greater than or equal to one half. 

With reference to Fig. 3, when speech signal s(n) enters a channel processing unit 20, components s*(n) belonging 
to a particular frequency band are isolated by a bandpass filter 26. Bandpass filter 26 uses downsampling to reduce 
computational requirements, and does so without any significant impact on system performance. Bandpass filter 26 
can be implemented as a Finite Impulse Response (FIR) or Infinite Impulse Response (MR) filter, or by using an FFT 
In the described embodiment, bandpass filter 26 is implemented using a thirty two point real input FFT to compute the 
outputs of a thirty two point FIR filter at seventeen frequencies, and achieves a downsampling factor of Sby shifting 
the input by S samples each time the FFT is computed. For example, if a first FFT used samples one through thirty 
two, a downsampling factor of ten would be achieved by using samples eleven through forty two in a second FFT 

A first nonlinear operation unit 28 then performs a nonlinear operation on the isolated frequency band s'(n) to 
emphasize the fundamental frequency of the isolated frequency band s'(n). For complex values of s'(n) (i greater than 
zero), the absolute value, I s'(n)l , is used. For the real value of s°(n), s°(n) is used if s°(n) is greater than zero and 
zero is used if s°(n) is less than or equal to zero. 

The output of nonlinear operation unit 28 is passed through a lowpass filtering and downsampling unit 30 to reduce 
the data rate and consequently reduce the computational requirements of later components of the system Lowpass 
filtering and downsampling unit 30 uses an FIR filter computed every other sample for a downsampling factor of two 

A windowing and FFT unit 32 multiplies the output of lowpass filtering and downsampling unit 30 by a window and 
computes a real input FFT S^co), of the product. Typically, windowing and FFT unit 32 uses a Hamming window and 
a real input FFT 

Finally, a second nonlinear operation unit 34 performs a nonlinear operation on S'(to) to facilitate estimation of 
voiced or total energy and to ensure that the outputs of channel processing units 20, T'(co). combine constructively if 
used in f undamentaf frequency estimation The absolute value squared is used because it makes all components of 
T'(o>) real and positive. 

With reference to Fig. 4, second parameter estimator 16 produces the second preliminary V/UV estimates using 
a sinusoid detector/estimator. Channel processing units 36 in second parameter estimator 16 divide speech signal s 
(n) into at least two frequency bands and process the frequency bands to produce a first set of signals, designated as 
R°(!) .. R'(l). Channel processing units 36 are differentiated by the parameters of a bandpass filter used in the first 
stage of each channel processing unit 36. In the described embodiment, there are sixteen channel processing units (I 
equals 15). The number of channels (value of I) in Fig. 4 does not have to equal the number of channels (value of I) 
in Fig. 2. 

A remap unit 38 transforms the first set of signals to produce a second set of signals, designated as S°(l) . S K (I). 
The remap unit can be an identity system. In the described embodiment, there are eight signals in the second set of 
signals (K equals 7). Thus, remap unit 38 maps the signals from the sixteen channel processing units 36 into eight 
signals Remap unit 38 does so by combining consecutive pairs of signals from the first set into single signals in the 
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second set. For example, R°(l) and R'(l) are combined to produce S°(l), and R 14 (l) and R ,s (l) are combined to produce 
S 7 (l). Other approaches to remapping could also be used. 

Next, V/UV parameter estimation units 40, each associated with a signal from the second set, produce preliminary 
V/UV parameters B° to B K by computing a ratio of the sinusoidal energy in the signal to the total energy in the signal 
and subtracting this ratio from 1 : 

B* = 1.0-S*(1J/S*(0). 

With reference to Fig. 5. when speech signal s(n) enters a channel processing unit 36, components s'(n) belonging 
to a particular frequency band are isolated by a bandpass filter 26 that operates identically to the bandpass filters of 
channel processing units 20 (see Fig. 3). It should be noted that, to reduce computation requirements, the same band- 
pass filters may be used in channel processing units 20 and 36, with the outputs of each filter being supplied to a first 
nonlinear operation unit 28 of a channel processing unit 20 and a window and correlate unit 42 of a channel processing 
unit 36. 

A window and correlate unit 42 then produces two correlation values for the isolated frequency band s'(n). The 
first value, R'(0), provides a measure of the total energy in the frequency band: 

* J <o> = U^di)! 2 * \sUn + s) | 2 ] ] 2 

2 J3 = 0 



where A/is related to the size of the window and typically defines an interval of 20 milliseconds and S is the number 
of samples by which the bandpass filter shifts the input speech samples. The second value, R'(1 ), provides a measure 
of the sinusoidal energy in the frequency band: 

n = Q 



Combination block 18 produces voiced/unvoiced parameters V° to V* by selecting the minimum of a preliminary 
V/UV parameter from the first set and a function of a preliminary V/UV parameter from the second set. In particular, 
combination block produces the voiced/unvoiced parameters as: 

V* = min<A\ r-g<B*)) 

where 

f 8 (Bf)= B k + *(k) p«o 0 ). 
P(<o 0 ) =1.0, when to o >27t/60.0, or 

2n/(60co 0 ) . when <o 0 <2n/60.0 

and a(k) is an increasing function of k. Because a preliminary V/UV parameter having a value close to zero has a 
higher probability of being correct than a preliminary V/UV parameter having a larger value, the selection of the minimum 
value results in the selection of the value that is most likely to be correct. 

With reference to Fig. 6, in another embodiment, a first parameter estimator 14* produces the first preliminary V/ 
UV estimate using an autocorrelation domain approach. Channel processing units 44 in first parameter estimator 14' 
divide speech signal s(n) into at least two frequency bands and process the frequency bands to produce a first set of 
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frequency band signals, designated as T°(l) .. T K (I). There are eight channel processing units (K equals 7) and no 
remapping unit is necessary. 

Next, voiced/unvoiced (V/UV) parameter estimation units 46, each associated with a channel processing unit 44, 
produce preliminary V/UV parameters A 0 to A K by computing a ratio of the voiced energy in the frequency band at an 
estimated pitch period n c to the total energy in the frequency band and subtracting this ratio from 1 : 

A k =^0-E k v (n o yE k t . 
The voiced energy in the frequency band is computed as: 

*J(n 0 ) = C(n a ) 7*(n 0 ) 

where 



1 

N- 1 

]P ur(n) w(n+n 0 ) 

71 = 0 

N is the number of samples in the window and typically has a value of 101, and C(n c ) compensates for the window 
roll-oft as a function of increasing autocorrelation lag. For non-integer values of n 0 , the voiced energy at the nearest 
three values of n are used with a parabolic interpolation method to obtain the voiced energy for n G . The total energy 
is determined as the voiced energy for n G equal to zero. 

With reference to Fig. 7, when speech signal s(n) enters a channel processing unit 44, components s'{n) belonging 
to a particular frequency band are isolated by a bandpass filter 48. Bandpass filter 48 uses downsampling to reduce 
computational requirements, and does so without any significant impact on system performance. Bandpass filter 48 
can be implemented as a Finite Impulse Response (FIR) or Infinite Impulse Response (MR) filter, or by using an FFT 
A downsampling factor of S is achieved by shifting the input speech samples by S each time the filter outputs are 
computed. 

A nonlinear operation unit 50 then performs a nonlinear operation on the isolated frequency band s*(n) to emphasize 
the fundamental frequency of the isolated frequency band s'(n). For complex values of s'(n) (i greater than zero), the 
absolute value, I s'(n)l , is used. For the real value of s°(n), no nonlinear operation is performed. 

The output of nonlinear operation unit 50 is passed through a highpass filter 52, and the output of the highpass 
filter is passed through an autocorrelation unit 54. A 101 point window is used, and, to reduce computation, the auto- 
correlation is only computed at a few samples nearest the pitch period. 

With reference again to Fig. 4, second parameter estimator 16 may also use other approaches to produce the 
second voiced/unvoiced estimate. For example, well-known techniques such as using the height of the peak of the 
cepstrum, using the height of the peak of the autocorrelation of a linear prediction coder residual, MBE model parameter 
estimation methods, or I MBE (TM) model parameter estimation methods may be used. In addition, with reference again 
to Fig. 5, window and correlate unit 42 may produce autocorrelation values for the isolated frequency band s'(n) as: 

RHl) » Re [Y, s x {n + 1) w(n + 1) s** (n) w(n) } 
n 

where w(n) is the window With this approach, combination block 18 produces the voiced/unvoiced parameters as: 

V* = min(y4* B k ). 

The fundamental frequency may be estimated using a number of approaches First, with reference to Fig 8, a 
fundamental frequency estimation unit 56 includes a combining unit 58 and an estimator 60. Combining unit 58 sums 
the T'(<o) outputs of channel processing units 20 (Fig. 2) to produce X(co). In an alternative approach, combining unit 
58 could estimate a signal-to-noise ratio (SNR) for the output of each channel processing unit 20 and weigh the various 



C(nJ - 
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outputs so that an output with a higher SNR contributes more to X(co) than does an output with a lower SNR. 

Estimator 60 then estimates the fundamental frequency (w 0 ) by selecting a value for w Q that maximizes X(a) 0 )over 
an interval from a) min to co max . Since X(co) is only available at discrete samples of o>, parabolic interpolation of°X(co 0 ) 
near co G is used to improve accuracy of the estimate. Estimator 60 further improves the accuracy of the fundamental 
estimate by combining parabolic estimates near the peaks of the N harmonics of co 0 within the bandwidth of X(co). 

Once an estimate of the fundamental frequency is determined, the voiced energy E v (co Q ) is computed as: 



w 



N 

n=l "„€J n 



where 



'„ = [(n-0.25)a> o ,(/i40.25)a) o |. 

Thereafter, the voiced energy E v (0.5a> o ) is computed and compared to E v (o> 0 ) to select between o> 0 and 0.5<o 0 as the 
final estimate of the fundamental frequency. ° 

With reference to Fig. 9, an alternative fundamental frequency estimation unit 62 includes a nonlinear operation 
unit 64. a windowing and Fast Fourier Transform (FFT) unit 66 ; and an estimator 68. Nonlinear operation unit 64 
performs a nonlinear operation, the absolute value squared, on s(n) to emphasize the fundamental frequency of s(n) 
and to facilitate determination of the voiced energy when estimating co 0 . 

Windowing and FFT unit 66 multiplies the output of nonlinear operation unit 64 to segment it and computes an 
FFT, X(w), of the resulting product. Finally, estimator 68, which works identically to estimator 60, generates an estimate 
of the fundamental frequency. 

With reference to Fig. 10, a hybrid fundamental frequency estimation unit 70 includes a band combination and 
estimation unit 72, an IMBE estimation unit 74 and an estimate combination unit 76. Band combination and estimation 
unit 70 combines the outputs of channel processing units 20 (Fig. 2) using simple summation or a signal-to-noise ratio 
(SNR) weighting where bands with higher SNRs are given higher weight in the combination. From the combined signal 
(U(o)), unrt 72 estimates a fundamental frequency and a probability that the fundamental frequency is correct. Unit 72 
estimates the fundamental frequency by choosing the frequency that maximizes the voiced energy (E^wJ) from the 
combined signal, which is determined as: ° 



AO 



N 



where 



45 / n = [(n-0.25)to 0 , (n+0.25)« o J. 

and N is the number of harmonics of the fundamental frequency. The probability that w 0 is correct is estimated by 
comparing E v (co 0 ) to the total energy E t , which is computed as: 



so 



V<a J0 >0.5G) o 



55 When E v (<o 0 ) is close to E t , the probability estimate is near one. When E v (o> 0 ) is close to one half of E„ the probability 
estimate is near zero. 

IMBE estimation unit 74 uses the well known IMBE technique, or a similar technique, to produce a second funda- 
mental frequency estimate and probability of correctness. Thereafter, estimate combination unit 76 combines the two 
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fundamental frequency estimates to produce the final fundamental frequency estimate. The probabilities of correctness 
are used so that the estimate with higher probability of correctness is selected or given the most weight. 

With reference to Fig. 11, a voiced/unvoiced parameter smoothing unit 78 performs a smoothing operation to 
remove voicing errors that might result from rapid transitions in the speech signal. Unit 76 produces a smoothed voiced/ 
unvoiced parameter as: 

v*(n) = 1.0. when /(n-1)/(n+1) = 1 and 
v ( n), otherwise 

where the voiced/unvoiced parameters equal zero for unvoiced speech and one for voiced speech. When the voiced/ 
unvoiced parameters have continuous values, with a value near zero corresponding to highly voiced speech, unit 78 
produces a smoothed voiced/unvoiced parameter that is smoothed in both the time and frequency domains: 

v k s {n) = X k (n)mmlv k (n) t a*(n). p*(n). y k (n)) 

where 

a(n) = 2v^ (n), when k=0, 1 K-1 , or 

oo, when k=K; 

p*(n) = 2v k '\n), whenk=2, 3 K, or 

when k=0, 1 ; 

y k (n) = 0.25V*" 1 (n) + 0.5 v k (n) + 0.25 / +1 (n). 
whenk=A, 2, or 
oo, when k=0. K: 

X k (n) = 0.8, when v k fn-1 ) < r%-1 ; and 
K(")-<^(n-1)| <0.25|co 0 (n)| . or 
1 , otherwise; 

and T*(n) is a threshold value that is a function of time and frequency. 

With reference to Fig 12, a voiced/unvoiced parameter improvement unit 80 produces improved voiced/unvoiced 
parameters by comparing the voiced/unvoiced parameter produced when the estimated fundamental frequency equals 
oi Q to a voiced/unvoiced parameter produced when the estimated fundamental frequency equals one half of co 0 and 
selecting the parameter having the lowest value. In particular, voiced/unvoiced parameter improvement unit 80 pro- 
duces improved voiced/unvoiced parameters as: 

A k (v 0 ) = m\n{A k (<» 0 ), A k (0.5o^ 0 )) 

where 

>**(«) = 1.0- (©)/£?. 
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With reference to Fig. 13, an improved estimate of the fundamental frequency (<o c ) is generated according to a 
procedure 100. The initial fundamental frequency estimate <« 0 ) is generated according to one of the procedures 
described above and is used in step 101 to generate a set of evaluation frequencies 5* . The evaluation frequencies 
are typically chosen to be near the integer submultiples and multiples of o 0 . Thereafter, functions are evaluated at 
this set of evaluation frequencies (step 102). The functions that are evaluated typically consist of the voiced energy 
function E v ( c k ) and the normalized frame error E£ £jk ). The normalized frame error is computed as 

E f (X3 k ) =1.0-F V <GS*) /E e l& k ) . 

The final fundamental frequency estimate is then selected (step 103) using the evaluation frequencies, the function 
values at the evaluation frequencies, the predicted fundamental frequency (described below), the final fundamental 
frequency estimates from previous frames, and the above function values from previous frames. When these inputs 
indicate that one evaluation frequency has a much higher probability of being the correct fundamental frequency than 
the others, then it is chosen. Otherwise, if two evaluation frequencies have similar probability of being correct and the 
normalized error for the previous frame is relatively low, then the evaluation frequency closest to the final fundamental 
frequency from the previous frame is chosen. Otherwise, if two evaluation frequencies have similar probability of being 
correct, then the one closest to the predicted fundamental frequency is chosen. The predicted fundamental frequency 
for the next frame is generated (step 1 04) using the final fundamental frequency estimates from the current and previous 
frames, a delta fundamental frequency, and normalized frame errors computed at the final fundamental frequency 
estimate for the current frame and previous frames. The delta fundamental frequency is computed from the frame to 
frame difference in the final fundamental frequency estimate when the normalized frame errors for these frames are 
relatively low and the percentage change in fundamental frequency is low. otherwise, it is computed from previous 
values. When the normalized error for the current frame is relatively low, the predicted fundamental for the current 
frame is set to the final fundamental frequency. The predicted fundamental for the next frame is set to the sum of the 
predicted fundamental for the current frame and the delta fundamental frequency for the current frame. 

Claims 

1. A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising dividing the digitized speech signal into one or 
more frequency band signals; and, preferably at regular intervals of time, performing the further step of: determining 
a first preliminary excitation parameter using a first method that includes performing a nonlinear operation on at 
least one of the frequency band signals to produce at least one modified frequency band signal and determining 
the first preliminary excitation parameter using the at least one modified frequency band signal; determining at 
least a second preliminary excitation parameter using at least a second method different from the said first method: 
and using the first and at least a second preliminary excitation parameters to determine an excitation parameter 
for the digitized speech signal 

2. A method according to Claim 1 . wherein at least one of the second methods uses at least one of the frequency 
band signals without performing the said nonlinear operation. 

3. A method according to Claims 1 or 2, wherein the excitation parameter comprises a voiced/unvoiced parameter 
for at least one frequency band, said parameter preferably having values that vary over a continuous range. 

4. A method according to any preceding claim, further comprising determining a fundamental frequency for the dig- 
itized speech signal. 

5. A method according to Claim 3, wherein the first preliminary excitation parameter comprises a first voiced/unvoiced 
parameter for the at least one modified frequency band signal, and wherein the first determining step includes 
determining the first voiced/unvoiced parameter by comparing voiced energy in the modified frequency band signal 
to total energy in the modified frequency band signal. 

6. A method according to Claim 5, wherein the voiced energy in the modified frequency band signal corresponds to 
the energy associated with an estimated fundamental frequency for the digitized speech signal. 
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7. A method according to Claim 5, wherein the voiced energy in the modified frequency band signal corresponds to 
the energy associated with an estimated pitch period for the digitized speech signal. 

8. A method according to Claim 5, wherein the second preliminary excitation parameter includes a second voiced/ 
unvoiced parameter for the at least one frequency band signal, and wherein the second determining step includes 
determining the second voiced/unvoiced parameter by comparing sinusoidal energy in the at least one frequency 
band signal to total energy in the at least one frequency band signal. 

9. A method according to Claim 5, wherein the second preliminary excitation parameter includes a second voiced/ 
unvoiced parameter for the at least one frequency band signal, and wherein the second determining step includes 
determining the second voiced/unvoiced parameter by autocorrelating the at least one frequency band signal. 

1 0. A method according to any preceding claim, wherein the said using step emphasizes the first preliminary excitation 
parameter over the second preliminary excitation parameter in determining the excitation parameter for the digitized 
speech signal when the first preliminary excitation parameter has a higher probability of being correct than does 
the second preliminary excitation parameter. 

11. A method according to any preceding claim, further comprising smoothing the excitation parameter to produce a 
smoothed excitation parameter. 

12. A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising the steps of: determining preliminary excitation 
parameters from the digitized speech signal; and smoothing the preliminary excitation parameters to produce 
excitation parameters. 

13. A method according to Claim 12, wherein the preliminary excitation parameters include a preliminary voiced/un- 
voiced parameter for at least one frequency band and the excitation parameters include a voiced/unvoiced pa- 
rameter for at least one frequency band, which voiced/unvoiced parameter preferably has values that vary over a 
continuous range. 

14. A method according to Claim 13, wherein the excitation parameters include a fundamental frequency. 

15. A method according to Claims 13 or 14, wherein the smoothing step makes the voiced/unvoiced parameter more 
voiced than the preliminary voiced/unvoiced parameter when voiced/unvoiced parameters that are nearby in time 
and/or frequency are voiced. 

16. A method according to Claim 1 2, wherein the smoothing step is performed as a function of time and/or frequency. 

1 7. A method of analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
preferably as a step in encoding speech, the method comprising the steps of: estimating a fundamental frequency 
for the digitized speech signal: evaluating a voiced/unvoiced function using the estimated fundamental frequency 
to produce a first preliminary voiced/unvoiced parameter; evaluating the voiced/unvoiced function at least using 
one other frequency derived from the estimated fundamental frequency to produce at least one other preliminary 
voiced/unvoiced parameter; and combining the first and at least one other preliminary voiced/unvoiced parameters 
to produce a voiced/unvoiced parameter 

18. A method according to Claim 17. wherein the said at least one other frequency is derived from the said estimated 
fundamental frequency as a multiple or submultipte of the said estimated fundamental frequency 

19. A method according to Claim 17. wherein the combining step includes choosing the first preliminary voiced/un- 
voiced parameter as the voiced/unvoiced parameter when the first preliminary voiced/unvoiced parameter indi- 
cates that the digitized speech signal is more voiced than does the second preliminary voiced/unvoiced parameter. 

20. A method of synthesizing speech using excitation parameters, where the excitation parameters are estimated by 
using a method for determining such parameters according to any preceding claim. 

21. A method of analysing a digitized speech signal to determine a fundamental frequency estimate for the digitized 
speech signal, comprising the steps of: determining a predicted fundamental frequency estimate from previous 
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fundamental frequency estimates; determining an initial fundamental frequency estimate; evaluating an error func- 
tion at the initial fundamental frequency estimate to produce a first error function value; evaluating the error function 
at at least one other frequency derived from the initial fundamental frequency estimate to produce at least one 
other error function value; selecting a fundamental frequency estimate using the predicted fundamental frequency 
estimate, the initial fundamental frequency estimate, the first error function value, and the at least one other error 
function value. 



22. A method according to Claim 21 . wherein the said at least one other frequency is derived from the said estimated 
fundamental frequency as a multiple or submultiple of the said estimated fundamental frequency. 

23. A method according to Claim 21, wherein the predicted fundamental frequency is determined by adding a delta 
factor to a previous predicted fundamental frequency, which delta factor is preferably determined from previous 
first and at least one other error function values, the previous predicted fundamental frequency, and a previous 
delta factor. 

24. A method of synthesizing speech using a fundamental frequency, where the fundamental frequency is estimated 
using a method according to any of Claims 21 , 22 or 23. 

25. A system for analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
comprising; means for dividing the digitized speech signal into one or more frequency band signals; means for 
determining a first preliminary excitation parameter using a first method that includes performing a nonlinear op- 
eration on at least one of the frequency band signals to produce at least one modified frequency band signal and 
determining the first preliminary excitation parameter using the at least one modified frequency band signal; means 
for determining a second preliminary excitation parameter using a second method that is different from the above 
said first method; and means for using the first and second preliminary excitation parameters to determine an 
excitation parameter for the digitized speech signal. 



26. A system for analysing a digitized speech signal to determine excitation parameters for the digitized speech signal, 
comprising; means for determining preliminary excitation parameters from the digitized speech signal; and means 
for smoothing the preliminary excitation parameters to produce excitation parameters. 

27. A system for analysing a digitized speech signal to determine modified excitation parameters for the digitized 
speech signal, comprising, means for estimating a fundamental frequency for the digitized speech signal; means 
for evaluating a voiced/unvoiced function using the estimated fundamental frequency to produce a first preliminary 
voiced/unvoiced parameter; means for evaluating the voiced/unvoiced function using another frequency derived 
from the estimated fundamental frequency to produce a second preliminary voiced/unvoiced parameter; and 
means for combining the first and second preliminary voiced/unvoiced parameters to produce a voiced/un voiced 
parameter. 
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28. A system for analysing a digitized speech signal to determine a fundamental frequency estimate for the digitized 
speech signal, comprising: means for determining a predicted fundamental frequency estimate from previous fun- 
damental frequency estimates; means for determining an initial fundamental frequency estimate: means for eval- 
uating an error function at the initial fundamental frequency estimate to produce a first error function value; means 
for evaluating the error function at at least one other frequency derived from the initial fundamental frequency 
estimate to produce a second error function value; and means for selecting a fundamental frequency estimate 
using the predicted fundamental frequency estimate, the initial fundamental frequency estimate, the first error 
function value, and the second error function value. 

29. A method of analysing a digitized speech signal to determine a voiced/unvoiced function for the digitized speech 
signal, comprising: dividing the digitized speech signal into at least two frequency band signals; determining a first 
preliminary voiced/unvoiced function for at least two of the frequency band signals using a first method; determining 
a second preliminary voiced/unvoiced function for at least two of the frequency band signals using a second method 
which is different from the above said first method; and using the first and second preliminary excitation parameters 
to determine a voiced/unvoiced function for at least two of the frequency band signals. 
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